Model specification is an element of statistics whose importance is often glossed over. In most cases, there are multiple appropriate models for a set of data. The end choice of model can hugely impact the results, and classical methods offer limited guidance on the best process for accounting for the uncertainty this creates. Unstructured searches and checks for the best model specification can lead to incorrect inferences, fragile reported findings, and publication bias (Montgomery and Nyhan 2010). Frequentist approaches to model selection include but are not limited to using the Akaike Information Criterion (AIC) and using the root mean squared error (RMSE) or R^2. We will thus elaborate a little on AIC and two other common information criterion, BIC and WAIC.
Information Criterion are commonly used to select the ‘best’ model from a set or to understand how much better one model is compared to another. They can be thought of as trying to estimate how good the model is at fitting out of sample data, when no such data is available. All of these criterion consist of a likelihood component, which estimates how good the model is at fitting within sample data, and a bias corrector, to account for how models are usually better at fitting within sample data than out of sample data. For all information criterion, a lower score is better. They can only be directly compared to themselves. Note that the mathematical conventions for the following sections will be as such: \(L\) is the model’s likehood function, \(k\) is the number of parameters in the model, \(n\) is the number of data points, and \(M\) is a specific model.
The Akaike Information Criterion (AIC) is the simplest information criterion, and the most commonly used:
\[AIC_M = -2 ln(L) + 2k\]
Given multiple models, you can estimate the probability a model \(M\) minimizes information loss as such, where \(AIC_{min}\) is the lowest score in the set of models. AIC is traditionally used when a test set is not as feasible, such as with small datasets or with time series. AIC first fits model \(M\) to the training data, then regularizes by the complexity of the model, given \(k\) parameters. It is thus the models with the lowest AIC score than have the best balance of data fit and parameter simplicity.
The AIC can be used to obtain a probabilistic measure of information maximization, by exponentiating the AIC scores for a particular model compared to the minimum AIC score of the set of models, \(AIC_{min}\). This provides a probability \(p_M\) that a particular model minimizes information loss, relative to all the other models tested. Understandably, using the model with \(AIC_{min}\) returns \(p_M = 1\).
\[ p_M = \frac{exp\left(-\frac{1}{2}(AIC_{min} - AIC_k)\right)}{\sum_{i=1}^K exp\left(-\frac{1}{2}(AIC_{min} - AIC_k)\right)} \]
AIC fits well in the freqentist frame. The likelihood component is assumed to come from the maximum likelihood estimates of paramters, which Bayesian methods do not select for, and point estimates for parameters are used, instead of using the full posterior distribution. (Zajic 2019)
The misleadingly named Bayesian Information Criterion (BIC), is given by:
\[BIC = - 2ln(L) + k\;ln(n)\] Where \(k\) is the number of parameters in the model, \(n\) is the number of data points, and \(L\) is again the likelihood function.
BIC penalizes free parameters more than the AIC when data sets are large. While AIC tries to select the best model that describes the data presented, the BIC attempts to select the true model from among a model set. However, like AIC, BIC doesn’t naturally fit with the Bayesian framework as it is scored by the likelihood function. (Statistics How To 2020)
One information criterion that fits naturally within the Bayesian framework is the Widely Applicable Information Criterion (WAIC):
\[WAIC = -2\sum_{i=1}^nlog\left(\frac{1}{S}\sum_{s=1}^S \pi(y_i\vert\theta^s) \right) +2\sum_{i=1}^nV_{s=1}^S(log(\pi(y_i\vert\theta^s)))\]
Where \(S\) is the number of posterior draws, \(\theta^s\) is the s-th posterior draw for parameters \(theta\), and \(V^S_{s=1}(a_s)\) is the sample variance of \(a_s\).
The first half of WAIC uses the posterior distribution of \(\theta\), as opposed to only likelihood estimates for theta as are used in AIC and BIC. This means that the WAIC is fully Bayesian. The second half is the bias corrector which can be thought of as the number of unconstrained parameters accounting for the complex ways Bayesian parameters interact. (Watanbe 2013)
Bayesian Model Averaging (BMA) offers an alternative to information criterion that helps ensure findings are robust to a variety of model specifications. At its simplest level, BMA assigns priors to potential model specifications and then caluclates posterior distributions for the model itself, in addition to the coefficients within the specification. This is thus an extension of previous Bayesian theory that focuses solely on coefficient estimation.
Let us begin by considering a matrix \(X\) of all the \(n \times p\) potential independent variables to predict a response variable \(Y\). A standard linear analysis would assume \(Y = X \beta + \epsilon\), where \(\beta\) is a coefficient matrix and \(\epsilon\) ~ \(N(0, \sigma^2)\). There are \(q=2^p\) potential model specifications from the model space \(\{M_1, M_2, ...M_q\}\), and in many cases, there is ambiguity about which of these is best. BMA incorporates this uncertainty into the process rather than ignoring it and claiming that the final model is the only option. This leads to greater flexibility in the inferences of the end results (Montgomery and Nyhan 2010).
Each model \(M_k\) encompasses the likelihood function \(L(Y|\beta_k, M_k)\) of the observed data \(Y\) in terms of a model-specific coefficient matrix \(\beta_k\) with a prior \(\pi(\beta_k|M_k)\). Both the likelihood and priors are conditional on a particular model (Fragoso, Bertoli, and Louzada 2018). The posterior distribution for the model parameters is then \[\pi(\beta_k|Y, M_k) = \frac{L(Y|\beta_k, M_k) \; \pi(\beta_k | M_k)}{\int L(Y|\beta_k, M_k) \; \pi(\beta_k | M_k) \; d\beta_k}\]
The above denominator represents the marginal distribution of the observations over all paramteter values specified in model \(M_k\). It is called the model’s marginal likelihood or model evidence and is denoted by \(\pi(Y|M_k)\). BMA now assumes a prior distribution over the model space describing the prior uncertainty over each model’s capability to accurately describe and/or predict the data. This is modeled as a probability density over the model space, with values \(\pi(M_k)\) for \(k = 1, 2, ... q\) (Fragoso, Bertoli, and Louzada 2018).
Then, the posterior probability of model specification \(M_k\) is \[\pi(M_k | Y) = \frac{L(M_k | Y) \; \pi(M_k)}{\sum_{k=0}^q \; \pi(Y|M_k)\; \pi(M_k)}\]. (Yao et al. 2018)
Traditional BMA is not without issues. For example, changing priors from one vague prior to another can significantly impact posterior model probablities. In addition, calculating model likelihoods becomes extremely difficult for anything but the simplest models.
As a result of these issues, a method called Pseudo-BMA was created. This has the same core idea as BMA, but instead of calculating model likelhoods, something called Leave-One-Out cross validation (LOO-CV) is used.
As a result of these issues, a method called Pseudo-BMA was created. This has the same core idea as BMA, but instead of calculating model likelhoods, something called Leave-One-Out cross validation (LOO-CV) is used (Yao et al. 2018).
Similar to the information criterion, LOO-CV estimates how good a model is at predicting out of sample data. To do this, it fits the model on all but one data points, then checks how well it predicts the missing data point. This is then repeated for every data point in the sample.
Put mathematically given S simulation draws, this is:
\[ \sum_{i=1}^nlog(\pi(y_i\vert y_1,\dots y_{i-1},y_{i+1},y_n)) \]
Where the term in the \(log\) is the probability of seeing \(y_i\) from a model trained on all the data excluding \(y_i\).
By adding a complicated bias term, this can be turned into an information criterion. Both AIC and WAIC will approach LOO-CV as sample size increases.
While exact LOO requires \(n\) iterations, one for each point in \(y\), Pseudo-BMA reduces computational complexity by taking samples from the posterior distribution. Using a technique called Pareto Smoothed Importance Sampling (PSIS), LOO-CV is estimated. This method also allows Pseudo-BMA to be used on models with parameters that have already been fitted.
Given a dataset \(y\), models \(M_k\), weights \(w_{i,k}^s\), and parameters \(\theta\), PSIS-LOO is an efficent way to approximate the estimate of the expected log pointwise predictive density for model k, which is the measure of how good model k is at fitting out of sample data:
\[ \widehat{elpd}^k=\sum_{i=1}^n log(\hat{p} (y_i | y_1,\dots y_{i-1},y_{i+1},y_n, M_K)) \]
Where conditioning on \(M_k\) means using that model for parameter estimates.
This is similar to model probabilities using AIC, model probabilities are calculated by:
\[ w_k = \frac{exp(\widehat{elpd}^k)}{\sum_{k=1}^Kexp(\widehat{elpd}^k)} \]
In practice, one more step involving bootstrapping is used to deal with additional bias in the estimate. This final model weight \(w_k\) is the approximate probability of model k being the true model.
These model probabilities can be used in a variety of ways. For example, BMA allows for a direct combination of models to calculate combined estimations for parameters, which leads to lower risk predictions than a single model (Fragoso, Bertoli, and Louzada 2018). In addition, BMA can be used in model selection by choosing the model with the highest posterior probability. We will focus on this latter application.
library(ggplot2)
library(dplyr)
library(rstan)
library(rstanarm)
library(bayesplot)
library(loo)
library(tidyr)
library(ggridges)
library(purrr)
library(gridExtra)
library(knitr)
library(formattable)
library(tidyverse)
library(tidyr)
library(wesanderson)
theme_set(theme_minimal())
# prediction_summary function
prediction_summary_data <- function(y, yrep, prob_inner = 0.5, prob_outer = 0.95){
# Calculate summary statistics of simulated
# posterior predictive models for each case
l_outer <- function(x){quantile(x, (1-prob_outer) / 2)}
l_inner <- function(x){quantile(x, (1-prob_inner) / 2)}
u_inner <- function(x){quantile(x, 1 - (1-prob_inner) / 2)}
u_outer <- function(x){quantile(x, 1 - (1-prob_outer) / 2)}
df <- data.frame(yrep) %>%
summarize_all(list(mean, sd, median, mad, l_outer, l_inner, u_inner, u_outer)) %>%
unlist() %>%
matrix(., length(y), 8) %>%
data.frame()
names(df) <- c("post_mean", "post_sd", "post_median", "post_mad", "l_outer", "l_inner", "u_inner", "u_outer")
data.frame(cbind(y, df))
}
prediction_summary <- function(y, yrep, prob_inner = 0.5, prob_outer = 0.95){
# This function summarizes the predictions across all cases
pred_data <- prediction_summary_data(y, yrep, prob_inner = prob_inner, prob_outer = prob_outer) %>%
mutate(error = y - post_median) %>%
mutate(error_scaled = error / post_mad) %>%
mutate(within_inner = (y >= l_inner) & (y <= u_inner)) %>%
mutate(within_outer = (y >= l_outer) & (y <= u_outer))
pred_summary <- pred_data %>%
summarize(mae = median(abs(error)),
mae_scaled = median(abs(error_scaled)),
within_inner = mean(within_inner),
within_outer = mean(within_outer)
)
names(pred_summary)[3] <- paste0("within_", prob_inner*100)
names(pred_summary)[4] <- paste0("within_", prob_outer*100)
pred_summary
}
To illustrate Pseudo-BMA in practice, we’ve chosen the famous candy dataset from the fivethirtyeight package. This dataset was generated by pitting 86 Halloween candies against each other, and letting a crowd vote on the winner. In the end, 8371 IP addresses voted on 269000 randomly generated pairs of candy. The win percentages for each candy was then recorded, along with some additional data. The variables included in the dataset are defined below.
| Variable | Definition |
|---|---|
competitorname |
The name of the candy |
| True/False variables | |
chocolate |
Does it contain chocolate? |
fruity |
Is it fruit flavored? |
caramel |
Is there caramel in the candy? |
peanutyalmondy |
Does it contain any of the following: peanuts, peanut butter, or almonds? |
nougat |
Does it contain nougat? |
crispedricewafer |
Does it contain any of the following: crisped rice, wafers, or a cookie component? |
hard |
Is it a hard candy? |
bar |
Is it a candy bar? |
pluribus |
Is it one of many candies in a bag or box? |
| Continuous variables | |
sugarpercent |
The percentile of sugar it falls under within the data set |
pricepercent |
The unit price percentile compared to the rest of the set |
winpercent |
The overall win percentage according to the 269,000 matchups |
We will be modeling the effect of the above variables on the win percentage in order to illustrate Pseduo-BMA. Before moving into the modeling process, however, it is helpful to explore the data and build intuition about what results we might expect. Below, the top 20 candies by win percentage are plotted.
candy %>%
head(20) %>%
ggplot() +
geom_col(aes(y = reorder(competitorname, winpercent), x = winpercent, fill=competitorname)) +
labs(x = "Percent of Head to Heads Won", y = NULL, title = "Top 20 Candies") +
scale_fill_manual(values = pal) +
theme(legend.position = "none")
3 Musketeers and 100 Grand are a closely-matched top two. Is the name alone responsible for their success, due to brand recognition, loyalty, or some unique component? Or, is there some other variable driving their success - for example type, sugar content, or price? We first investigate the effect of type. Below is a plot of how many candies are each type, followed by the win percentage for each type of candy.
candy %>%
summarise_if(is.logical, sum) %>%
pivot_longer(1:ncol(.), names_to = "type", values_to = "count") %>%
mutate(type_clean = c("Chocolate", "Fruity", "Caramel", "Peanuty or Almondy", "Nougat", "Crisped Rice Wafer", "Hard", "Bar", "Several Candies in One Bag")) %>%
ggplot(aes(y = reorder(type_clean, count), x = count, fill=type_clean)) +
geom_col() +
labs(y = NULL, x = NULL, title = "Number of Candies with each Attribute") +
scale_fill_manual(values = pal) +
theme(legend.position = "none")
candy %>%
select(winpercent, 2:10) %>%
pivot_longer(2:10, names_to = "type", values_to = "is_type") %>%
filter(is_type) %>%
select(-is_type) %>%
group_by(type) %>%
mutate(mean_win = mean(winpercent)) %>%
ungroup() %>%
ggplot(aes(x = winpercent, y = reorder(type, mean_win), fill=type, height = stat(density))) +
geom_density_ridges(stat = "binline", bins = 20, scale = 0.95, draw_baseline = FALSE) +
labs(x = "Percent of Head to Heads Won", y = NULL, title = "Candy Type by Win Percent") +
scale_y_discrete(labels = rev(c("Crisped Rice Wafer", "Peanuty or Almondy", "Bar", "Chocolate", "Nougat", "Caramel", "Several Candies in One Bag", "Fruity", "Hard"))) +
scale_fill_manual(values = pal) +
theme(legend.position = "none")
We can see that having several candies in each bag, being fruity, and being chocolate are the most common characteristics. However, it appears that these are not necessarily the most popular. While popularity is not clearly stratified by type, in general peanuty or almondy candy and crisped rice or wafer candy seem to be at the top, while hard candy and fruity candy lean more towards the bottom. As type of candy is not obviously a significant predictor, we next explore the effect of sugar content on the win percentage, plotted below.
candy %>%
ggplot(aes(x = sugarpercent, y = winpercent)) +
geom_jitter() +
labs(x = "Sugar Content (Percent)", y = "Percent of Head to Heads Won", title = "Sugar Content versus Win Rate") +
ylim(0, 100) +
scale_fill_manual(values = pal) +
theme(legend.position = "none")
Once again, there is not an obvious relationship between the sugar content and the win percentage. This means that we would not expect sugar content to be an important predictor in the best model - even if it is statistically significant, it will likely not be practically significant.
Finally, we look at how the price affects the win percentage.
candy %>%
ggplot(aes(x = pricepercent, y = winpercent)) +
geom_jitter() +
labs(x = "Relative Price", y = "Percent of Head to Heads Won", title = "Price versus Win Rate") +
scale_fill_manual(values = pal) +
theme(legend.position = "none")
As before, relative price is not a clear predictor of success. However, this may be due to how the data was collected - remember that this data shows people’s favorite candy, not people’s most commonly purchased candy. Price may be an important predictor of what they buy, but not what they prefer.
Now that we have built some intutition of what variables affect win percentage, it is time to officially run some models. We start by creating four regular stan models. In this case, each model differs by the number of variables considered. Model 1 uses only sugar content as a predictor variable, Model 2 looks at both sugar content and relative price, Model 3 includes each of the binary type variables and no continuous variables, while Model 4 uses all of the possible explanatory variables. The posterior distributions for each of these models are approximated via Monte Carlo Markov Chains (MCMC).
set.seed(454)
model_1 <- stan_glm(winpercent ~ sugarpercent,
data = candy,
family = gaussian,
chains = 4, iter = 2*5000, refresh = FALSE)
model_2 <- stan_glm(winpercent ~ sugarpercent + pricepercent,
data = candy,
family = gaussian,
chains = 4, iter = 2*5000, refresh = FALSE)
model_3 <- stan_glm(winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus,
data = candy,
family = gaussian,
chains = 4, iter = 2*5000, refresh = FALSE)
model_4 <- stan_glm(winpercent ~ chocolate + fruity + caramel + peanutyalmondy + nougat + crispedricewafer + hard + bar + pluribus + sugarpercent + pricepercent,
data = candy,
family = gaussian,
chains = 4, iter = 2*5000, refresh = FALSE)
model_list <- list(model_1 = model_1, model_2 = model_2, model_3 = model_3, model_4 = model_4)
Now that we have the models, we can use tools such as trace plots and density overlays to assess their stability. The trace plot and overlay for model 1 is shown below:
mcmc_trace(model_1)
mcmc_dens_overlay(model_1)
While the trace plots show a fairly wide band, indicating some variation in the values used for each variable, no flatlining is present. We can also see that all models returned similar distributions for each of the parameters in model 1 (Intercept, sugarpercent, sigma). Similar behavior is seen in all the trace plots and density overlays for each model (see Appendix), thus the models appear usable.
To assess if the structure of our models is reasonable, we can use pp_check() to compare simulated samples to the real values.
grid.arrange(pp_check(model_1) + labs(title = 'Model 1'), pp_check(model_2) + labs(title = 'Model 2'),
pp_check(model_3) + labs(title = 'Model 3'), pp_check(model_4) + labs(title = 'Model 4'))
All models seem to roughly fit the distribution of the data. There are a few noticeable outliers in every model, and each model appears to have a higher peak than the distribution of the actual data, but in general we do not consider these wrong models.
As we are satsified, we can proceed with these models. The coefficients are printed below for Models 1-4, respectively.
mod1 <- as.data.frame(model_1$coefficients)
formattable(mod1, align=c("l"), col.names=c("Coefficient"), title="Model 1")
| Coefficient | |
|---|---|
| (Intercept) | 44.67007 |
| sugarpercent | 11.81196 |
mod2 <- as.data.frame(model_2$coefficients)
formattable(mod2, align=c("l"), col.names=c("Coefficient"))
| Coefficient | |
|---|---|
| (Intercept) | 39.802262 |
| sugarpercent | 6.722496 |
| pricepercent | 15.580509 |
mod3 <- as.data.frame(model_3$coefficients)
formattable(mod3, align=c("l"), col.names=c("Coefficient"))
| Coefficient | |
|---|---|
| (Intercept) | 35.2393393 |
| chocolateTRUE | 19.6522822 |
| fruityTRUE | 10.0194746 |
| caramelTRUE | 3.3625741 |
| peanutyalmondyTRUE | 9.9863058 |
| nougatTRUE | 2.3087578 |
| crispedricewaferTRUE | 8.8676580 |
| hardTRUE | -4.8126445 |
| barTRUE | -0.5306398 |
| pluribusTRUE | -0.1460981 |
mod4 <- as.data.frame(model_4$coefficients)
formattable(mod4, align=c("l"), col.names=c("Coefficient"))
| Coefficient | |
|---|---|
| (Intercept) | 34.7283883 |
| chocolateTRUE | 19.5438496 |
| fruityTRUE | 9.1570800 |
| caramelTRUE | 2.2049288 |
| peanutyalmondyTRUE | 9.9400899 |
| nougatTRUE | 0.7589168 |
| crispedricewaferTRUE | 8.7610339 |
| hardTRUE | -6.0805383 |
| barTRUE | 0.5256453 |
| pluribusTRUE | -0.8447881 |
| sugarpercent | 9.1226100 |
| pricepercent | -5.7884718 |
Once the models are calculated, we calcuate Leave One Out expected log point predictive densities (ELPD LOO) for each model. These are shown below, transformed onto the information criterion scale.
(loo_1 <- loo(model_1))$estimates
(loo_2 <- loo(model_2))$estimates
(loo_3 <- loo(model_3))$estimates
(loo_4 <- loo(model_4))$estimates
lpd_point <- cbind(
loo_1$pointwise[,"elpd_loo"],
loo_2$pointwise[,"elpd_loo"],
loo_3$pointwise[,"elpd_loo"],
loo_4$pointwise[,"elpd_loo"]
)
lpd <- as.data.frame(lpd_point)
elpd_1 <- loo(model_1)$estimates[3]
elpd_2 <- loo(model_2)$estimates[3]
elpd_3 <- loo(model_3)$estimates[3]
elpd_4 <- loo(model_4)$estimates[3]
elpds <- cbind(elpd_1, elpd_2, elpd_3, elpd_4)
elpds <- as.data.frame(elpds)
formattable(elpds, align=c("l"), col.names=c("Model 1", "Model 2", "Model 3", "Model 4"))
| Model 1 | Model 2 | Model 3 | Model 4 |
|---|---|---|---|
| 698.5986 | 693.4692 | 659.8231 | 660.5028 |
For comparison, we’ll also display the WAIC for each model.
waic(model_1)
waic(model_2)
waic(model_3)
waic(model_4)
waic_1 <- waic(model_1)$estimates[3]
waic_2 <- waic(model_2)$estimates[3]
waic_3 <- waic(model_3)$estimates[3]
waic_4 <- waic(model_4)$estimates[3]
waics <- as.data.frame(cbind(waic_1, waic_2, waic_3, waic_4))
formattable(waics,align=c("l"), col.names=c("Model 1", "Model 2", "Model 3", "Model 4"))
| Model 1 | Model 2 | Model 3 | Model 4 |
|---|---|---|---|
| 698.5868 | 693.4144 | 659.1211 | 659.4947 |
Notably, the calculation of WAIC is unstable for these models, and shouldn’t be trusted.
With the ELPD calculations, the models are able to be ranked in order of their contribution, as a percentage. As shown, Model 3 has the highest weight with 0.534, followed by Model 4. By this metric, Model 3 appears to be the best model from the set of 4 we created above.
(weights <- pseudobma_weights(lpd_point))
## Method: pseudo-BMA+ with Bayesian bootstrap
## ------
## weight
## model1 0.000
## model2 0.002
## model3 0.549
## model4 0.449
A new type of candy is created (new_candy), which is turned into a dataframe for prediction. A table of predictions for all models is then created, titled predictions.
new_candy <- data.frame(chocolate = TRUE, fruity = TRUE, caramel = TRUE, peanutyalmondy = TRUE, nougat = TRUE, crispedricewafer = TRUE, hard = FALSE, bar = FALSE,
pluribus = FALSE, sugarpercent = 0.20, pricepercent = 0.9)
my_predict <- function(model) {
posterior_predict(model, newdata = new_candy)
}
make_df <- function(predictions, index) {
data.frame(winpercent_new = predictions[,1], model = index)
}
predictions <- map(model_list, my_predict) %>%
map2(1:4, make_df)
Using the weights we calculated from using the pseudobma_weights() function to obtain ELPD scores, we can then generate the predictive distributions for the win percentage of our made up candy, defined above.
sampled_pred <- predictions %>% #calculating sample predictions for each model
map2(weights, sample_frac) %>%
bind_rows() %>%
mutate(model = as.character(model))
predictions %>%
bind_rows() %>%
ggplot(aes(x = winpercent_new, fill = as.character(model), color = as.character(model))) +
geom_density_ridges(alpha = 0.6, aes(y = reorder(as.character(model), -model))) + labs(title = 'Posterior Predictive Distributions for The Four Models', y = "Model", x = "Predicted Percent of Wins") +
scale_fill_manual(values = pal2) +
scale_color_manual(values = pal2) +
theme(legend.position = "none")
Unsurprisingly, the different models provided slightly different predictions. The \(MAP\) winpercent for Model 1, Model 2 appear to be closer to 45-50%, while the Model 3, Model 4 show a ‘winpercent’ of closer to 80%.
sampled_pred %>%
ggplot(aes(x = winpercent_new, fill = as.character(model), color=as.character(model))) +
geom_histogram(alpha=0.8,
position = position_stack(),
binwidth = 5,
boundary = 0) +
labs(x = "Predicted Percent of Wins", y = "Draws", fill = "Model", color = "Model") +
scale_fill_manual(values = pal2) +
scale_color_manual(values = pal2)
As a more visual demonstration of the weights, the contribution of each model is evident in the predictive distribution, with model 3 contributing the largest, followed by model 4. Thus, it is not surprising that the pseudo-BMA model also has an \(MAP\) of around 80%, given the dominance of Model 3 in the weighted averaging.
As shown above, it is evident that the pseudo-BMA model’s prediction is heavily reliant on model 3 & 4, with similar \(MAP\) values of around 80% for the win percentage for the made up candy. To better understand the advantage of Pseudo-BMA, we can compare its predictions to that of model 3, the ‘best’ standalone model, using the ppc_intervals function.
The mean predictions for the Pseudo-BMA model were obtained by a weighted average of all model predictions. Then Model 3’s predictions could be compared visually to the Pseudo-BMA model:
pred3 <- colMeans(candy_predictions3) #needed to take the mean for plotting, below
ppc_intervals(candy$winpercent[1:20],
yrep = full_pred_mat[,1:20], prob_outer = 0.95) +
ggplot2::scale_x_continuous(
labels = candy$competitorname[1:20],
breaks = 1:20
) +
lims(y = c(5, 90)) +
geom_point(aes(x = c(1:20), y = pred3[1:20]), color = 'red') +
labs(title ="Pseudo-BMA model vs Model 3 \nOn The First 20 Candies",
subtitle = 'Red Dots are Model 3 average predictions',
x = NULL,
y = "Win Percent") +
coord_flip()
As shown by the plot of the first 20 candies, the predictions for model 3 and the pseudo-BMA are fairly close, which makes sense given th weighted average nature of the pseudo-BMA model. It is interesting to note the variation in these model results, however, such as with Dum Dums, Fun Dip, we see that model 3 is more accurate in predicting win percentage. Out of these visual summary, however, the data seems to indicate that the pseudo-BMA model was on average much closer to the actual win percentage than model 3.
bind_rows(prediction_summary(y = candy$winpercent,
yrep = candy_predictions3) %>%
mutate(model = "Model 3") %>%
select(model, everything()),
prediction_summary(y = candy$winpercent,
yrep = full_pred_mat) %>%
mutate(model = "Combined Model")
) %>%
kable()
| model | mae | mae_scaled | within_50 | within_95 |
|---|---|---|---|---|
| Model 3 | 6.685484 | 1.835887 | 0.1882353 | 0.5058824 |
| Combined Model | 6.388162 | 1.676153 | 0.2470588 | 0.5647059 |
While hard to tell which is better visually, prediction_summary() plainly indicates the better model. Due to the use of a weighted average, Pseudo BMA is evidently more accurate than model 3. On average, pseudo-BMA was closer to the actual win percentage of a candy than model 3, (MAE), and 57.6% of the candy’s win percentage is within the middle 95% of pseudo-BMA’s predictive distribution. While the pseudo-BMA predictions are better than the individual models, far more of the candies than should be are outside of the 95% intervals. This means the model vastly overestimates its accuracy, and perhaps another model entirely should be used.
The concept of taking a weighted average of the predictions from multiple models is not a new technique, given the obvious benefits of increased flexibility and robustness to predictions. However, weighted averages are predominantly seen in Frequentist approaches to modelling, and less so in the Bayesian approach. Bayesian Model Averaging, and more specifically Pseudo-BMA, is a more computationally efficient method to evaluate each model’s predictive distribution, assigning ELPD scores for each model. ELPD, as fractional indication of model weight, allows for predictions from each model to be combined in a manner that provides the optimal predictive distribution, a Pseudo-BMA model. As shown by the candy dataset, Pseudo-BMA naturally outperforms the best single-model calculated, and could only improve with even more models tested.
mcmc_trace(model_2)
mcmc_dens_overlay(model_2)
mcmc_trace(model_3)
mcmc_dens_overlay(model_3)
mcmc_trace(model_4)
mcmc_dens_overlay(model_4)
Models graphed in that order:
candy_plotter(model_1)
candy_plotter(model_2)
candy_plotter(model_4)
Fragoso, T.M., W. Bertoli, and F. Louzada. 2018. “Bayesian Model Averaging: A Systematic Review and Conceptual Classification.” International Statistical Review. https://doi.org/10.1111/insr.12243.
Montgomery, Jacob, and Brendan Nyhan. 2010. “Bayesian Model Averaging: Theoretical Developments and Practical Applications.” Political Analysis.
Statistics How To. 2020. “Bayesian Information Criterion (Bic) / Schwarz Criterion .” https://www.statisticshowto.com/bayesian-information-criterion/.
Watanbe, Sumio. 2013. “A Widely Applicable Bayesian Information Criterion.” Journal of Machine Learning Research.
Yao, Yuling, Aki Vehtari, Daniel Simpson, and Andrew Gelman. 2018. “Using Stacking to Average Bayesian Predictive Distributions (with Discussion).” Bayesian Analysis.
Zajic, Alexandre. 2019. “Introduction to Aic — Akaike Information Criterion.” https://towardsdatascience.com/introduction-to-aic-akaike-information-criterion-9c9ba1c96ced.